Ai based remixing of music: timbre transformation and matching of mixed audio data

ABSTRACT

The present invention provides a method for processing audio data, comprising the steps of providing input audio data containing a mixture of audio data including first audio data of a first musical timbre and second audio data of a second musical timbre different from said first musical timbre, decomposing the input audio data to provide decomposed data representative of the first audio data, transforming the decomposed data to obtain third audio data.

The present invention relates to a method and a device for processing music audio data, in particular mixed audio data which contain a mixture of several musical timbres.

Various types of audio devices are known which process and play music audio data, for example music players, DJ systems, Digital Audio Workstations (DAW). Conventional audio devices receive music audio data, such as stereo audio files, from a music service provider via the Internet or from a local storage medium, for example a CD. In other applications, audio data containing music are received from an analogous playback device or directly from a live input/recording through an analog/digital converter.

It is also known to process and modify music audio data for artistic reasons or to prepare them for playback or other purposes. For example, music players often allow the application of simple sound effects such as equalizer effects. More sophisticated playback devices, such as DJ systems, provide additional options for transforming and modifying the music, for example by mixing a song with another song or applying loop or reverb effects etc. Moreover, music production equipment or DAW applications provide extensive options for editing music audio data through time shifting or reassembling, changing pitch, applying one or more sound effects and mixing the data with other audio data.

Conventional effects and modifications of audio data are usually unrelated to the actual musical content of the audio data, such as musical instruments, vocal components, melodies or harmonies, all of which forming elements of the musical composition. Since most music is provided in the form of mixed stereo audio files, conventional effects may be able to modify the overall sound of a piece of music, but not the musical composition as such. On the other hand, if the audio data are manipulated in a more disruptive manner, such as by loop roll/beat masher effects or by mixing them with the audio data of another piece of music, it is usually difficult to preserve the musical character and/or the flow of the original piece of music.

One conventional approach to modify music while preserving the musical character is disclosed in “DDSP: Differentiable Digital Signal Processing”,

Jesse Engel et al., paper at ICLR 2020. This paper reports on transforming the musical timbre of a piece of music into different musical timbres, such as to convert, for example, a singing voice into a violin timbre of the same melody. However, such a concept cannot easily be applied in practice, because most pieces of music contain a mixture of different musical timbres originating from recording a mixture of different musical instruments, singing voices, etc. The timbre transfer of the conventional type would then lead to artefacts or, if at all, would convert a piece of music into a single-timbre melody line based on the most prominent melody of the original piece of music.

It was therefore an object of the present invention to provide a method and a device for processing audio, which allow a modification of input audio data containing a mixture of different musical timbres, while preserving a musical character.

According to a first aspect of the present invention, this object is achieved by a method for processing audio data, comprising the steps of providing input audio data containing a mixture of audio data including first audio data of a first musical timbre and second audio data of a second musical timbre different from said first musical timbre, decomposing the input audio data to provide decomposed data representative of the first audio data, and transforming the decomposed data to obtain third audio data, wherein transforming the decomposed data includes at least one of (1) changing musical timbre such that the third audio data are of a third musical timbre different from the first musical timbre, and (2) changing melody such that the third audio data represent a melody different from that of the decomposed data.

According to an important feature of the invention, the method therefore includes a step of decomposing the input audio data to provide at least first decomposed data representative of the first audio data, i.e. audio data of the first musical timbre, wherein the modification is applied individually to the decomposed data, such as to only affect a certain timbre. The other musical timbres that may be included in the mixed input audio data may remain unchanged or may be modified through other individual effects or transformations. As a result, a modification of the audio data can take the elements of the musical composition into account, in particular the individual musical timbres originating from original musical instruments or singing voices etc. This provides a variety of new ways to modify music to achieve certain effects while preserving the basic character of the music.

As a mere example, if the input audio data contain a mixture of a singing voice timbre, a guitar timbre and a drum timbre, the method of the invention may decompose the input audio data to isolate the singing voice timbre as the decomposed data, and may then transform the isolated singing voice timbre into a piano timbre, as the third audio data. As a result, third audio data are obtained, in which the singing voice timbre is substituted by a piano timbre.

It should be noted that, in the context of the present disclosure, decomposing the input audio data refers to separating or isolating specific timbres from other timbres, which in the original input audio data were mixed in parallel, i.e. overlapped on the time axis, such as to be played together within the same time interval. Likewise, it should be noted that recombining or mixing of audio data or tracks refers to overlapping in parallel, summing, downmixing or simultaneously playing/combining corresponding time intervals of the audio data or tracks, i.e. without shifting the audio data or tracks relative to one another with respect to the time axis.

Furthermore, in the context of the present disclosure, input audio data containing a mixture of audio data are representative of audio signals obtained from mixing a plurality of source tracks, in particular during music production or during recording of a live musical performance of instrumentalists and/or vocalists. Thus, input audio data may usually originate from a previous mixing process that has been completed before the start of the processing of audio data according to the present invention. In particular, the mixed audio data may be included in audio files together with meta data, for example in audio files containing a piece of music that has been produced in a recording studio by mixing a plurality of source tracks of different timbres. For example, a first source track may be a vocal track (vocal timbre) obtained from recording a vocalist via a microphone, while a second source track may be an instrumental track (instrumental timbre) obtained from recording an instrumentalist via a microphone or via a direct line signal from the instrument or via MIDI through a virtual instrument. Usually, a plurality of such tracks are recorded at the same time or one after another. The plurality of source tracks are then transferred to a mixing station, wherein the source tracks are individually edited, various sound effects and individual volume levels are applied to the source tracks, all source tracks are mixed in parallel, and preferably one or more mastering effects are eventually applied to the sum of all tracks. At the end of the production process, the final audio mix, usually a stereo mix, is stored in a suitable recording medium, for example in an audio file on the hard drive of a computer. Such audio files preferably have a conventional compressed or uncompressed audio file format, such as MP3, WAV, AIFF or other, in order to be readable by standard playback devices, such as computers, tablets, smartphones or DJ devices. For processing according to the present invention, the input audio data may then be provided as audio files by reading the files from local storage means, receiving the audio files from a remote server, for example via streaming through the Internet, or in any other manner.

Input audio data according to the present invention usually represent stereophonic audio signals and are thus provided in the form of stereo audio files, although other types, such as mono audio files or multichannel audio files may be used as well.

Thus, input audio data include a mixture of audio data of different musical timbres, wherein the timbres originate from different sound sources, such as different musical instruments, different software instruments or samples, different voices etc. In particular, a certain timbre may refer to at least one of:

-   -   a recorded sound of a certain musical instrument (such as a         bass, piano, drums (including classical drum set sounds,         electronic drum set sounds, percussion sounds), guitar, flute,         organ etc.) or any group of such instruments;     -   a synthesizer sound that has been synthesized by an analog or         digital synthesizer, for example to resemble the sound of a         certain musical instrument (such as a bass, piano, drums         (including classical drum set sounds, electronic drum set         sounds, percussion sounds), guitar, flute, organ etc.) or any         group of such instruments;     -   a sound of a vocalist (such as a singing or rapping vocalist) or         a group of vocalists;     -   any combination thereof.

These timbres relate to specific frequency components and distributions of frequency components within the spectrum of the audio data as well as temporal distributions of frequency components within the audio data, and they may be separated through an Al system specifically trained with training data containing these timbres, as will be explained in more detail later.

According to an embodiment of the present invention, transforming the decomposed data includes changing musical timbre, wherein the third audio data (the transformed data) and the decomposed data represent musical tones of the same melody or no melody. Therefore, a musical character can be readily preserved by keeping the melody constant and just changing the timbre of the decomposed data.

In the present disclosure, a musical tone may have a pitch value selected from the known chromatic scale, i.e. from C, C#, D, D#, E, F, F#, G, G#, A, A# H, C′, C#′, D′, . . . etc, also known as the 12-tone equal temperament scale according to the most common tuning in Western music. A pitch value refers to lowness or highness of the musical tone, i.e. to the audible frequency of the sound produced by playback of the associated audio data of the musical tone, preferably based on a tuning standard using 440 Hz for musical tone A above middle C as a reference note, wherein the frequency of the sound doubles for each octave up. In particular, pitch values according to the present disclosure refer to the equal temperament tuning system as used in western culture music. Furthermore, a musical tone has a certain starting time with respect to the time axis of the song. In addition, a musical tone has a certain ending time or a certain duration with respect to the time axis of the song. A melody may be then defined by a certain sequence of musical tones. Alternatively, a melody could be understood as a pitch progression over a certain time interval.

Two melodies may however still be regarded as equal or unchanged if all tones are just transposed by one or more octaves up or down. Thus, if substantially each musical tone of the third audio data has a corresponding musical tone in the decomposed data, which has the same timing values (to be played at substantially the same point in time and for the same duration), but has a pitch value that is shifted with respect to that of the corresponding musical tone of the third audio data by one or more octaves up or down, the resulting musical tones of the third audio data should still be regarded as defining the same melody as the first audio data. Furthermore, both the decomposed data and the third audio data may have substantially no melody at all, in particular may have substantially constant pitch values, such as in the case of a drum timbre or a rap/spoken vocal timbre, for example.

Moreover, parts of the audio data which do not have a particular pitch or for which a particular pitch is not clearly recognizable (such as in the case of drums), are not to be regarded as forming part of a melody. Thus, if the decomposed data and the third audio data substantially only differ from one another in such parts for which a clear musical tone with a clear pitch value cannot be determined, the melodies of the first audio data and the third audio data are regarded as being equal or having no melody.

As noted above, transforming the decomposed data may include changing musical timbre. In addition or alternatively, transforming the decomposed data may include changing the melody, such that the third audio data (the transformed audio data) have a melody different from that of the decomposed data. In particular, since the input audio data and thus the decomposed data contains music defined by a temporal sequence of musical tones, these tones and in particular their pitch values and timing values (time and duration) define a melody of the music. The step of transforming the decomposed data may include modifying some or all of the musical tones such as to modify the melody. The melody is, however, not completely substituted by an unrelated, different musical sequence, but is modified on the basis of the original melody, for example using the same or similar set of musical tones but varied in time or order, such that a musical character of the original input audio data can be preserved.

In a further embodiment of the present invention, the third audio data may have the same time-dependent harmony or the same key as the first audio data or as the decomposed data. In other words, the step of transforming may preserve the harmony. The harmony of the music is a time-dependent parameter, which may also be understood as the chord progression within the music. The harmony of the music at a specific point in time may in particular be defined by a certain chord and root tone, such as for example C Major, C Major 7, A Minor etc. The key of the music is basically constant over the whole piece of music and relates to the root or key note of the tonic of the piece of music. If the first and third audio data have the same harmony (the same chord progression) or at least the same key, the musical character can be preserved while at the same time, for example the melody and/or the musical timbre may change.

Transforming the decomposed data while preserving harmony may be achieved by substituting the decomposed data with third audio data obtained from an audio track library.

An audio track library may include a plurality of candidate audio tracks, from which a user or a selection algorithm may select one of the tracks as a selected audio track. The selection algorithm may be configured to automatically select a candidate audio track or automatically propose a candidate audio track to the user for selection. Automatic selection or proposal may be based on musical parameters of the candidate audio tracks, such as tonic key (home key), time-dependent harmony, beat, tempo, melody and timbre, wherein the parameters may either be detected by the selection algorithm from a data analysis of the audio track or may be included in meta data stored together with the candidate audio tracks or stored within the audio tracks in the library. In this manner, the selection algorithm may for example automatically select a candidate audio track which has the same timbre as the decomposed data or the input audio data.

The third audio data may be directly formed by the selected audio track or by applying one or more sound effects to the selected audio track. Alternatively, the third audio data may be obtained from the selected audio track through a matching processing which changes at least one musical parameter of the selected audio track, such as tonic key (home key), time-dependent harmony, beat, tempo (beats per minute), melody and timbre, in such a manner as to match the corresponding musical parameters of the decomposed data or the corresponding musical parameters of the input audio data. The matching processing may use key/pitch matching, tempo matching or beat matching algorithms conventionally known as being included in DJ systems or DAW applications, for example.

In another embodiment of the invention, transforming the decomposed data while preserving harmony may be achieved by substituting the decomposed data with third audio data obtained from an audio track generator. In a simple embodiment, such audio track generator is a random generator generating random musical tones to be played with a predetermined timbre through a synthesizer or sample player. In a more sophisticated embodiment, a music generation algorithm may be used such as known from “Jukebox: A Generative Model for Music”, Prafulla Dhariwal et al., arXiv:2005.00341v1 [eess.AS] 30 Apr. 2020.

The third audio data may be directly formed by the generated audio track obtained from the audio track generator or by applying one or more sound effects to the generated audio track. Alternatively, the third audio data may be obtained from the generated audio track through a matching processing which changes at least one musical parameter of the generated audio track, such as tonic key (home key), time-dependent harmony, beat, tempo, melody and timbre, in such a manner as to match the corresponding musical parameter of the decomposed data or the corresponding musical parameter of the input audio data. The matching processing may use key/pitch matching, tempo matching or beat matching algorithms conventionally known as being included in DJ systems or DAW applications, for example. When the third audio data and the decomposed data have the same timbre but different melodies, the third audio data may be regarded as a variation of the decomposed data. For example, a certain piano melody may be exchanged by a different piano melody, preferably by a piano melody having the same harmony (or chord progression) but different musical tones than the original piano melody.

If transforming the decomposed data includes changing melody, the method may further include the steps of detecting a time-dependent musical harmony of the decomposed data (or the input audio data), and generating, based on the time dependent musical harmony, pitch data for a plurality of individual musical tones of the third audio data, which are to be played sequentially at respective predetermined points in time, such as to generate a melody of the third audio data. Thus, a new melody is generated by generating musical tones, wherein the time-dependent musical harmony of the original input audio data is taken into account. In particular, the musical tones of the third audio data are generated in such a manner as that the time-dependent musical harmony of the third audio data is the same as the time-dependent musical harmony of the input audio data or the decomposed data (which are usually equal). This can be achieved, for example, if more than 60%, preferably more than 80% of all musical tones generated at specific points in time such as to form the melody of the third audio data are musical tones that fit to the respective musical harmony at those respective points in time. Therein, a musical tone to be played at a specific point in time may be regarded as fitting to the particular musical harmony at that specific point in time, if a musical pitch of the musical tone belongs to a musical scale of the musical harmony, such as a major scale, a minor scale, a blues scale or any other scale defined in music theory and associated to the particular music contained in the input audio data. Further criteria may be applied for selecting musical tones from among the available musical tones, such as avoiding larger intervals or larger deviations from the beat of the music. Within the framework of such criteria, the particular selection of musical tones may be based on any suitable function or on a random generator.

An example for an algorithm for changing melody and generating musical tones according to a predetermined musical harmony progression, which may be used in embodiments of the present invention, is disclosed for example in “Bebopnet: Deep Neural Models for Personalized Jazz Improvisations”, Shunit Haviv Hakimi et al., proceedings of the 21^(st) ISMIR Conference.

According to another embodiment of the present invention, the method may further include pitch detection for detecting pitch data indicating the musical pitch of the decomposed data or the first audio data. Pitch data include time dependent pitch values, such as frequencies or musical pitch values as defined above, i.e. from C, C#, D, D#, E, F, F#, G, G#, A, A# H, C′, C#′, D′, for different points in time along the time axis. Pitch detection algorithms are known as such in the prior art and usually operate on the basis of an analysis of the frequency spectrum of the audio data in order to find dominating frequency portions which determine the pitch or melody of the music. Conventional pitch detection algorithms are known to be integrated in pitch detection plugins for DAW systems, for example. Another example for a conventional pitch detection algorithm that may be used in embodiments of the present invention is disclosed by Jong Wook Kim et al., “Crepe: A Convolutional Representation for Pitch Estimation”, arXiv:1802.06182 [eess.AS]. In addition or alternatively, the decomposed data as obtained from the step of decomposing the input audio data may as such include pitch data indicating the musical pitch of the first audio data. Based on pitch data obtained or detected in this manner, the transformation of the decomposed data to obtain the third audio data can be carried out while preserving the melody or the harmony progression of the original input audio data.

In a further embodiment of the present invention, the method further comprises a step of converting the decomposed data into event message data formed by a plurality of event messages of musical tones, wherein each event message at least specifies pitch data and velocity data of the musical tone, said event message data are preferably formatted according to the MIDI standard. The event message data may in particular be formed by event messages of the same length on the time axis such that the entire audio is represented by a sequence of such event messages, which define, for each point in time, the pitch and the velocity of the musical tone that is to be played at this point in time. This means that a longer tone, for example, will be represented by a plurality of subsequent event messages having equal pitch data. Silence may be represented by empty event messages or event messages having zero velocity. If such event message data are formatted according to the MIDI standard, this will offer a variety of ways to play the event message data through electronic musical instruments which support MIDI.

Based on the pitch data or the event message data obtained as described above, third audio data may be generated in the step of transforming the decomposed data by a synthesizer-based or sample-based generation of audio signals. In particular, a synthesizer or a sample player known as such may be operated based on the pitch data or the event message data to obtain the third audio data having a different timbre or a different melody than the decomposed data. This allows to rely upon a variety of synthesizer sounds and sample sounds available on the market, such as to have easy access to a plurality of different timbres.

In a further embodiment of the present invention, the step of transforming the decomposed data may involve processing of audio data obtained from the decomposed data within an artificial intelligence system, preferably a trained neural network. The artificial intelligence system (AI system) may be of the same type as disclosed in in “DDSP: Differentiable Digital Signal Processing”, Jesse Engel et al., paper at ICLR 2020. In particular, a neural network may be used that has been trained with training data, wherein different sets of training data contain different musical timbres and wherein pitch information, melody data or other data related to the music associated with the audio data are input into the network in association with the music as such, in order to allow the neural network to learn about the correlation between the music audio data and the musical parameters or the components of the musical composition.

Decomposing the input audio data may be carried out by an analysis of the frequency spectrum of the input audio data and identifying characteristic frequencies of certain musical instruments or vocals, for example based on a Fourier-transformation of audio data obtained from the input audio data. In a preferred embodiment of the present invention, the step of decomposing the input audio data involves processing of audio data obtained from the input audio data within an artificial intelligence system (AI system), preferably a trained neural network. In particular, an AI system may implement a convolutional neural network (CNN), which has been trained by a plurality of data sets for example including a vocal track, a harmonic/instrumental track and a mix of the vocal track and the harmonic/instrumental track. Examples for conventional AI systems capable of separating source tracks such as a singing voice track from a mixed audio signal include: Prétet, “Singing Voice Separation: A study on training data”, Acoustics, Speech and Signal Processing (ICASSP), 2019, pages 506-510; “spleeter”—an open-source tool provided by the music streaming company Deezer based on the teaching of Prétet above, “PhonicMind” (https://phonicmind.com)—a voice and source separator based on deep neural networks, “Open-Unmix”—a music source separator based on deep neural networks in the frequency domain, or “Demucs” by Facebook Al Research—a music source separator based on deep neural networks in the waveform domain. These tools accept music files in standard formats (for example MP3, WAV, AIFF) and decompose the song to provide decomposed/separated tracks of the song, for example a vocal track, a bass track, a drum track, an accompaniment track or any mixture thereof.

In a further embodiment of the present invention, the step of decomposing the input audio data provides first decomposed data representative of the first audio data, and second decomposed data representative of the second audio data, and the method further comprises a step of recombining audio data obtained from the third audio data with audio data obtained from the second decomposed data to obtain recombined audio data. Thus, the input audio data are decomposed into two components of different musical timbres, one of the components being transformed by changing the musical timbre and/or the melody, and afterwards the components are recombined again. The recombined audio data therefore includes original and transformed components at the same time and consequently has a similar musical character as the original piece of music. Specifically, the recombined audio data may be played through a playback device, further processed, stored or transmitted to another device as a new, separate piece of music.

In a further embodiment of the invention, the step of recombining may include recombining audio data obtained from the third audio data with audio data obtained from the first decomposed data. Thus, a mixture of the first decomposed data before and after the step of transforming may be obtained, which can be regarded as a wet/dry mixture, which for example may be obtained from recombining the output of a transforming unit (the third audio data) with audio data bypassing the transforming unit (the first decomposed data). According to a wet/dry ratio, which may be set by a user through a wet/dry control element, preferably a single control element to be controlled by a single movement, individual volume levels may be assigned to the third audio data and to the first decomposed data, respectively, and recombination may be carried out using these volume levels.

In a further embodiment of the present invention, the step of decomposing the input audio data provides a plurality of sets of decomposed data, wherein each set of decomposed data represents audio data of a predetermined musical timbre, such that a sum of all sets of decomposed data represents audio data substantially equal to the input audio data. The sets of decomposed data may thus resemble a complete decomposition of the input audio data such that all components of the original audio data may be available for further processing and recombination. Thus, after transforming one or more, but not all of the sets of decomposed data, the transformed data may be recombined with the remaining sets of decomposed data to achieve recombined audio data which contain the whole musical spectrum of the original piece of music with just one or some of the components being transformed while preserving melody, timbre or harmony.

In a further embodiment of the present invention, the input audio data are provided in the form of at least one input track formed by a plurality of audio frames, the input track having an input track length and each audio frame having an audio frame length, i.e. a predetermined duration based on the sample rate, and wherein the step of decomposing the input audio data comprises decomposing a plurality of input track segments each having a length smaller than the input track length and larger than the audio frame length. Decomposing the input audio data segment-wise allows obtaining at least parts of the results, i.e. segments of the transformed third audio data, faster than in a case where the method would wait for the entire input track to be processed completely. Thus, decomposing the plurality of input track segments may obtain a plurality of decomposed track segments, wherein transforming the decomposed data may be based on the plurality of decomposed track segments to obtain a plurality of third audio track segments, wherein a first segment of the third audio track segments is obtained before a second segment of the input track segments is being decomposed. Therefore, the third audio data may be obtained simultaneously to the processing of the input audio data, i.e. in parallel to the steps of decomposing and transforming. If the entire process of decomposing and transforming a segment is faster than the real time playback of a segment (of the same length), playback of audio data obtained from the third audio data can be started and carried out without interruptions as soon as a first segment of the input track has been decomposed and transformed.

More preferably, the generation of at least a segment of the third audio data is completed within a processing time of less than 10 seconds, preferably less than 5 seconds after providing the input audio data or a segment of the input audio data associated to the segment of the third audio data, such as to allow application of the method in a live situation, for example by a DJ, who may spontaneously decide to play a variation of piece of music with one of the timbres being modified.

In a further preferred embodiment of the present invention, the step of decomposing the input audio data provides first decomposed data representative of the first audio data and second decomposed data representative of the second audio data, and wherein the method further comprises a step of generating an output track which includes a first output track portion, which is obtained by recombining audio data obtained from the first decomposed data with audio data obtained from the second decomposed data or which includes the input audio data, and a second output track portion which is obtained by recombining audio data obtained from the third audio data with audio data obtained from the second decomposed data to obtain recombined audio data. Audio track portions herein refer to parts of the audio data, which are not to be played in parallel but sequentially, at different points in time. In other words, along the time axis, audio data representing a piece of music may be divided in several portions corresponding to different time intervals. According to the embodiment described above, the output track may comprise a first output track portion in which first and second decomposed data are recombined or which substantially includes the input audio data, such that the music to be played in the first output track portion is substantially equal to the original music in the corresponding portion of the input audio data. On the other hand, in a second output track portion, the transformed third audio data are recombined with the second decomposed data, such that this portion contains modified music, for example music in which one of the timbres has been substituted by another timbre. In this way, the output track may have original portions and modified portions and playback of the output track may thus switch from original to modified versions of the music during playback.

The output track may preferably include a transition portion between the first output track portion and the second output track portion, wherein, within the transition portion, in a direction from the first output track portion towards the second output track portion, a first volume level associated to the audio data obtained from the first decomposed data decreases and a second volume level associated to the audio data obtained from the third audio data increases. In this way, cross-fading or blending over may be possible from the first output track portion to the second output track portion or vice versa, for example by fading out the original timbre and fading in the modified timbre or vice versa, or by fading out the original melody and fading in the modified melody or vice versa.

In a further embodiment of the present invention, the input audio data are provided as first input audio data containing a first piece of music, wherein the method further comprises the steps of providing second input audio data containing a second piece of music different from the first piece of music, mixing audio data obtained from the first input audio data with audio data obtained from the second input audio data to obtain mixed audio data, and playback of audio data obtained from the mixed audio data. Therein, audio data obtained from the first input audio data may in particular be the third audio data obtained through the transforming step or decomposed data. This method is in particular applicable in a DJ system, which is configured to mix two different pieces of music or cross-fade between the two pieces of music, and finally play the mixed audio data. Modifying one of the two sets of input audio data by a method according to the present invention may allow a DJ to create remixes, mixes or cross-fades, which sound smoother or more interesting. For example, the DJ could substitute a vocal component of a first song with a piano component having the same melody, and then cross-fade from the first song towards the second song. The advantage is that a vocal component of the second song would then be prevented from colliding with the vocal component of the first song during a specific time interval. In fact, mixing two different vocal components of two different songs may often lead to an irritating sound and should hence be avoided. Instead, combining a vocal component of one song with a piano component of another song achieves better results and produces fewer sound clashes, provided of course that keys and/or beats/tempo of the two songs are matched to one another as conventionally known for DJ systems.

In the method as described before, which processes first input audio data and second input audio data, the second input audio data may contain a mixture of audio data including fourth audio data of a fourth musical timbre and fifth audio data of a fifth musical timbre, wherein the method may further comprise the steps of decomposing the first input audio data to provide first decomposed data representative of the first audio data, and second decomposed data representative of the second audio data, and decomposing the second input audio data to provide fourth decomposed data representative of the fourth audio data, and fifth decomposed data representative of the fifth audio data.

To assist crossfades between the two pieces of music, the step of transforming the first decomposed data to obtain the third audio data may include changing the musical timbre such that the third musical timbre substantially equals a musical timbre of the second input audio data, in particular the fourth musical timbre or the fifth musical timbre. This means that timbre matching is carried out between the first and second pieces of music by adapting the musical timbre of the decomposed audio data of the first piece of music to a musical timbre of the second piece of music. For this purpose, the musical timbre of the second input audio data may be obtained from the step of decomposing the second input audio data. For example, if the step of decomposing the second input audio data reveals a significant piano component within the second input audio data, the first decomposed data, which for example represent a certain melody with a guitar timbre, may be transformed to obtain third decomposed data, which represent the same melody but with a piano timbre.

In another embodiment of the invention, the method may include transforming the fourth decomposed data to obtain sixth audio data.

Therefore, modification or transformation of one of the timbres according to embodiments of the invention may preferably be applied to each of the first and second input audio data, which allows further adjusting the sound of both sets of audio data for additional creative effects or for further assisting cross-fading or blending over between two songs. In particular, the third audio data and the sixth audio data, i.e. the transformed components of both songs, may be of substantially the same timbre, such that they can be smoothly mixed on top of each other or swapped or cross-faded during a transition from one song to the other.

In a further embodiment of a method processing first and second input audio data, the method may further comprise a step of generating an output track which includes a first output track portion obtained by recombining audio data obtained from the third audio data with audio data obtained from the second decomposed data, and a second output track portion obtained by recombining audio data obtained from the sixth audio data with audio data obtained from the fifth decomposed data. Therefore, the output track may contain a first output track portion containing a modified version of the first song, and a second output track portion containing a modified version of the second song, wherein both output track portions are played sequentially, one after the other on the time axis. A transition may be played between the two output track portions in order to smoothly blend over from the first song to the second song, if desired. The modified versions of the two songs may either create a specific creative effect or may be used to let the transition between the two songs sound smoother.

According to a second aspect of the present invention, the above-mentioned object is achieved by a device for processing audio data, comprising an input unit configured to receive input audio data containing a mixture of audio data including first audio data of a first musical timbre and second audio data of a second musical timbre different from said first musical timbre, a decomposition unit for decomposing the input audio data to provide decomposed data representative of the first audio data, a transforming unit for transforming the decomposed data to obtain third audio data, wherein transforming unit includes at least one of (1) a timbre changing unit configured to change musical timbre such that the third audio data are of a third musical timbre different from the first musical timbre, and (2) a melody changing unit configured to change melody such that the third audio data have a melody different from that of the decomposed data.

Such a device provides technical means for carrying out a method of the first aspect of the invention and therefore achieves the same effects as described above for the first aspect of the invention. The device of the second embodiment may in particular be a computer, a tablet or a smartphone running a suitable software, or may be a standalone music processing device such as a DJ device. The device may also be embodied by a DAW software or a plugin for a DAW software or any other audio processing software.

In an embodiment of the second aspect of the invention, the device may include a pitch detection unit for detecting pitch data indicating the musical pitches of the decomposed data. The pitch detection unit may operate on the basis of one of the pitch detection algorithms described above with respect to the step of pitch detection in the method of the first aspect of the invention. Based on the pitch data, third audio data can more easily be generated by the transforming unit in such a manner as to preserve desired features of the musical character, because the pitch data allow analyzing musical tones, i.e. a melody of the music.

In particular, the device may comprise a synthesizer unit for synthesizer-based generation of audio signals and/or a sample player for sample-based generation of audio signals. The synthesizer unit and/or the sample player may be embodied by virtual units of the application software running on a computer, a tablet or a smartphone. Alternatively, a conventional synthesizer plugin, a sample player plugin or even a conventional hardware synthesizer or standalone sample player may be used, which may be connected to and operated by the device according to pitch data or event message data obtained from the decomposed data, for example through a pitch detection unit.

The device may further include one or more artificial intelligence systems, preferably trained neural networks as describe above in relation to the first aspect of the invention. In particular, at least one of the transforming unit, the timbre changing unit, the melody changing unit, the harmony detection unit, the pitch detection unit, the pitch data generating unit, and the data conversion unit may include such artificial intelligence system. The at least one artificial intelligence system may preferably be stored in a local memory of the device, such as a RAM, to achieve optimal performance. A first trained neural network may be used by the transforming unit, in particular the timbre changing unit and/or the melody changing unit, wherein a second trained neural network may be used by the decomposition unit.

Further embodiments of the device of the second aspect of the invention may include units and other means adapted to carry out individual steps and features of the embodiments of the method of the first aspect described above.

In particular, the device of the second embodiment may be a DJ device having a first input section for receiving first input audio data representing a first piece of music, and a second input section for receiving second input audio data containing a second piece of music different from the first piece of music, wherein the device may further comprise a mixing unit configured for mixing audio data obtained from the first input audio data with audio data obtained from the second input audio data to obtain mixed audio data, and a playback unit for playing audio data obtained from the mixed audio data. The principles of the present invention achieve significant advantages, in particular in DJ devices, because the invention allows spontaneous modification or transformation of music without changing the character of the music.

Preferred embodiments of the present invention will now be described with reference to the accompanying drawings, in which

FIG. 1 shows a diagram illustrating the configuration and function of a device according to an embodiment of the present invention,

FIG. 2 shows a diagram illustrating a method according to an embodiment of the present invention, which may be carried out using a device as illustrated in FIG. 1 , and

FIG. 3 shows a user control unit of the device according to FIG. 1 .

FIG. 1 shows a device 10 according to an embodiment of the present invention, which includes a number of units and sections, as will be described in the following in more detail, wherein the units and sections are connected to each other to transmit data, in particular audio data containing music. Device 10 may be implemented by a computer, a tablet or a smartphone running a suitable software application. Any input means of device 10 may thus be formed by standard input means, such as a keyboard, a mouse, a touchscreen, an external input device etc. Any output means may be embodied by a display of the device, by internal or external speaker or other output means known as such. Furthermore, any processing means may be formed by the electronic hardware of the computer, tablet or smartphone, such as microprocessors, RAM, ROM, internal or external storage means etc. Alternatively, device 10 may be a standalone DJ device or other dedicated audio equipment configured to process audio data and music in digital format.

Device 10 includes a first input section 12A, which receives a first input track representing a first piece of music, for example a song A, and a second input section 12B configured to receive a second input track representing a second piece of music, for example a song B. Both input sections 12A, 12B may be arranged to directly receive audio data in digital format or may alternatively include an analog-to-digital converter to convert an analog audio signal, for example from a recording of a live concert, from a broadcasting service or from an analog playback device, into digital audio data. Furthermore, first and second input sections 12A, 12B may include a decompression unit for decompressing compressed audio data received as first and second input tracks, for example to decompress audio data received in MP3 format. Audio data which is output by first and second input sections 12A, 12B are preferably uncompressed audio data, for example containing a predetermined number of audio frames per second according to the sampling rate of the data (usually 44.1 kHz or 48 kHz, for example).

Audio data obtained from the first input track are then transmitted from the first input section 12A to a first decomposition unit 14A. Audio data obtained from the second input track are transmitted from the second input section 12B to a second decomposition unit 14B. First and second decomposition units 14A, 14B may each include an artificial intelligence system having a trained neural network configured to separate different timbres contained in the first and second input tracks, for example a vocal timbre, a piano timbre, a bass timbre or a drum timbre etc. In particular, the decomposition units 14A, 14B may decompose the input tracks into several parallel decomposed tracks, wherein each of the decomposed tracks contains audio data of a specific musical timbre. Both decomposition units 14A, 14B may produce a complete decomposition of the input tracks such that a sum of all decomposed tracks provided by one decomposition unit will result in audio data that are substantially equal to the respective input track.

In the example illustrated in FIG. 1 , the first decomposition unit 14A is configured to decompose the first input track to obtain a first decomposed track, a second decomposed track and a third decomposed track. In case of a complete decomposition, the first decomposed track may for example be a vocal track containing the vocal timbre or vocal component of the first input track, the second decomposed track may be a drum track containing audio data representing the drum timbre or drum component of the first input track, and the third decomposed track may contain a sum of all remaining timbres or all remaining components of the first input track, which may for example be a bass timbre in a case where the piece of music only includes vocal, drum and bass components. Likewise, the second decomposition unit 14B, in the example shown in FIG. 1 , may be configured to decompose the second input track to obtain a fourth decomposed track, a fifth decomposed track and a sixth decomposed track, wherein in case of a complete decomposition, a sum of the fourth, fifth and sixth decomposed tracks may result in audio data substantially equal to the second input track.

At least one of the decomposed tracks produced by the decomposition units 14A, 14B is then passed through a transforming unit 16A or 16B. In the example shown in FIG. 1 , the first decomposed track is passed through the first transforming unit 16A, while the fourth decomposed track is passed through a second transforming unit 16B. Each of the transforming units 16A, 16B may include at least one of a timbre changing unit and a melody changing unit (not illustrated). A timbre-changing unit may use a timbre changing algorithm known as such in the prior art, which changes the timbre of an audio track to a specific other timbre, while maintaining the melody of the audio track. Alternatively, a melody-changing unit (not illustrated) of the first or second transforming unit 16A, 16B may be operative to change a melody of the first decomposed track or the fourth decomposed track, respectively.

The first transforming unit 16A outputs a first transformed track changed in timbre and/or melody, while the second transforming unit 16B outputs a second transformed track changed in timbre and/or melody.

Device 10 further includes a first recombination unit 18A and a second recombination unit 18B. The first recombination unit 18A is configured to recombine audio data of the several decomposed tracks of the first decomposition unit 14A, while the second recombination unit 18B is configured to recombine the several decomposed tracks of the second decomposition unit 14B. In the present example, first recombination unit 18A receives the first transformed track from the first transforming unit 16A, the first decomposed track that bypassed the first transforming unit 16A and thus is not transformed, the second decomposed track (not transformed) and the third decomposed track (not transformed). The second recombination unit 18B receives the second transformed track from the second transforming unit 16B, the fourth decomposed track that bypassed the second transforming unit 16B and thus is not transformed, the fifth decomposed track (not transformed) and the sixth decomposed track (not transformed). It should be noted that the number and types of decomposed tracks produced by the first and second decomposition units 14A, 14B and then recombined by the first and second recombination units 18A, 18B are merely exemplary and not intended to limit the present invention. There may be more or less decomposed tracks and the first and second decomposition units 14A, 14B may produce different numbers and/or types of decomposed tracks. Furthermore, more than one decomposed track may be transformed by a transforming unit and/or the type and parameters of transformation may be different among the several decomposed tracks.

Recombination units 18A, 18B each recombine the input decomposed tracks or transformed tracks by producing a sum signal of the tracks. This means that the decomposed tracks and the transformed tracks are overlaid in parallel to one another and their signals are added at each point in time. Each of the decomposed tracks and the transformed tracks may be assigned an individual volume level, which may be controllable by a user as will be explained later in more detail. Furthermore, in another embodiment of the invention, at least some of the decomposed tracks and the transformed tracks may receive one or more additional audio effects or sound effects, such as a hall effect, an equalizer effect etc. The effects may be controlled by a user as well, if desired.

First recombination unit 18A produces a first recombined track, while the second recombination unit 18B produces a second recombined track. As can be seen in FIG. 1 , in the present example, the first recombined track includes substantially all musical components of the first input track and therefore has a similar musical character as the first input track, since only one component of the music (the first decomposed track) has been modified by a timbre change or melody change. Likewise, the second recombined track has the same or similar musical character as the second input track, because only one component (the fourth decomposed track) has been modified by a timbre change or melody change.

First and second recombined tracks are then introduced into a mixing unit 20 in which they are mixed together in parallel. The first and second recombined tracks may be assigned different volume levels, if desired, which may be set by a user. Furthermore, one or more additional sound effects may be applied to the first recombined track, the second recombined track or to the sum signal output from the mixing unit, i.e. to an output track.

The output track produced by mixing unit 20 is then transmitted to a playback unit 22 for playback, for example through an internal speaker of device 10, headphones connected to a device 10 or any other PA device connected to device 10.

In addition, device 10 may include a user control unit 24, which may be configured to control operation and parameters or settings of several elements of the device. In particular, user control unit 24 may be connected to first and second input sections 12A, 12B for allowing a user to select songs as song A and song B, respectively, to decomposition units 14A, 14B for controlling parameters of the decomposition algorithms, to first and second transforming units 16A, 16B for selecting substitute timbres, which replace the original timbre, melody parameters or other settings, to first and second recombination units 18A, 18B for setting volume levels of the transformed tracks and/or the decomposed tracks, and to mixing unit 20 for setting volume levels for the first and second recombined tracks, effect parameters or other settings, for example. Control unit 24 may be embodied by a touch display of the computer, tablet or smartphone, by a keyboard, a mouse or by hardware controllers, including external controllers to be connected to device 10.

FIG. 2 illustrates a method according to an embodiment of the present invention, which may be carried out by using a device 10 as described above with reference to FIG. 1 .

In a first step of the method, a first input track and a second input track are provided, which represent different pieces of music, for example different songs A and B. Both input tracks are then decomposed to obtain several decomposed tracks, in the present example a piano track, a bass track and a drum track for song A, and a vocal track, a bass track and a drum track for song B.

In a subsequent step of transforming, one of the decomposed tracks of each song is transformed such as to change timbre. For example, the piano track of song A is transformed into a trumpet track having the same melody as the original piano track, while the bass track and the drum track remain unchanged. Furthermore, the vocal track of song B is transformed into a trumpet track of the same melody as the original vocal track, while the bass track and the drum track of song B remain unchanged as well.

In a subsequent step of recombining, the transformed tracks and the (not transformed) decomposed tracks of each song are recombined. For example, the transformed trumpet track of song A is recombined with the bass track and the drum track of song A such as to obtain a first recombined track. At the same time, the transformed trumpet track of song B is recombined with the decomposed bass track and the decomposed drum track of song B such as to obtain a second recombined track.

In a subsequent step of mixing, the first recombined track and the second recombined track are mixed together in parallel such as to obtain an output track which may be played through a playback unit.

FIG. 3 shows an embodiment of a user control unit, which may be used as user control unit 24 in the device 10 of the embodiment described with respect to FIG. 1 . It should, however, be noted that other suitable types and configurations of control units may be used to allow a user to control device 10.

User control unit 24 may be configured as a DJ application running on a suitable device, for example a tablet computer 26. Control elements and status information about an operational condition of device 10 may be displayed on a display 28 of the tablet computer 26, which is preferably a touchscreen accepting user input via touch gestures in order to allow a user to activate, move or otherwise manipulate control elements as will be described in more detail below. However, the application could run on any other suitable device, such as a computer, a smartphone or a standalone digital DJ device.

An example layout of the DJ application is illustrated in FIG. 3 . The layout is basically divided to show information and control elements relating to a song A in a left part of the layout, and information and control elements related to a different song B in the right part of the layout. Starting with the left part of the layout relating to a song A, a song-select section 30A is configured to allow a user to select a song A from a music library, for example from a music streaming provider for streaming via the Internet or from a local storage device. A song information section 32A displays information about the selected song A, such as a song name, a waveform representation 34A, a play head 36A identifying the current playback position within song A, or other information.

Furthermore, an effect control element 38A may be provided to control one or more sound effects to be applied to song A. In an example, effect control element 38A may be a scratch control element such as a virtual vinyl, which can be controlled by a user to simulate a scratching effect (controlling playback in accordance with manual rotation of the vinyl).

Furthermore, a play/stop control element 40A may be provided to start or pause playback of song A with the touch of a button.

For controlling song B or showing information about song B, the right part of the layout of the DJ application may comprise the same or corresponding control elements as described above for song A. In particular, song-select section 30B may allow a user to select a song, a song information section 32B may display information about song B such as a name, a waveform representation 34B and a play head 36B, and one or more effect control elements 38B and/or a play/stop control element 40B may be provided to control effects to be applied to song B and to control transport of song B, respectively.

The layout of the DJ application of user control unit 24 further includes a decomposition and transformation section 42 for controlling several functions relating to an interaction between songs A and B. In particular, in the present example, the first decomposition unit 14A (FIG. 1 ) is configured to decompose the first input track relating to song A, such as to provide a vocal-A track, a harmonic-A track and a drums-A track, which contain the respective vocal, harmonic and drum components contained in song A. Likewise, the second decomposition unit 14B is configured to decompose the second input track relating to song B, such as to provide a vocal-B track, a harmonic-B track and a drums-B track, which contain the respective vocal, harmonic and drum components contained in song B.

It should be noted that the harmonic-A track and the harmonic-B track may each comprise the sum of all remaining timbres included in song A or song B, respectively, i.e. the timbres obtained after subtracting the respective vocal and drums timbres from the original input track. Depending on the composition of the particular song, the harmonic timbre may therefore include a sum of several instrumental timbres, such as guitar timbres, piano timbres, synthesizer timbres, etc. Furthermore, it should be noted that the separation of the songs into vocal, harmonic and drums timbres are used as an example in the current embodiment, while the decomposition units may be configured to provide any other number or types of decomposed tracks including other timbres as desired.

A harmonic cross-fader 44H may be provided in the decomposition and transformation section 42 as a further control element, which allows cross-fading between playback of harmonic-A track and harmonic-B track. Thus, by operating the harmonic cross-fader 44H, which may be done with only one finger or in a single movement using a single control element, a ratio between a volume level assigned to harmonic-A track and a volume level assigned to harmonic-B track may be changed. In particular, harmonic cross-fader 44H can be controlled by a user within a control range having one end point at which the volume level assigned to harmonic-A track is maximum and the volume level assigned harmonic-B track is minimum, and a second end point at which the volume level assigned to harmonic-B track is maximum and the volume level assigned to harmonic-A track is minimum.

In addition, a drums cross-fader 44D may be provided in the decomposition and transformation section 42 as a further control element, which allows cross-fading between playback of drums-A track and drums-B track. Thus, by operating the drums cross-fader 44D, which may be done with only one finger or in a single movement using a single control element, a ratio between a volume level assigned to drums-A track and a volume level assigned to drums-B track may be changed. In particular, drums cross-fader 44D can be controlled by a user within a control range having one end point at which the volume level assigned to drums-A track is maximum and the volume level assigned drums-B track is minimum, and a second end point at which the volume level assigned to drums-B track is maximum and the volume level assigned to drums-A track is minimum.

Furthermore, user control unit 24 may comprise a first substitute section 46A associated to song A, and a second substitute section 46B associated to song B. Each substitute section 46A, 46B may allow a user to select one of a plurality of substitute timbres for substituting the timbre of the vocal-A track or the vocal-B track as desired. In the present example, each substitute section provides three substitute timbres: piano, flute and trumpet. Selecting one of the substitute timbres controls the first or second transforming unit 16A, 16B (FIG. 1 ) such as to generate, based on the vocal-A track or the vocal-B track and the selected substitute timbre, a first or a second transformed track, respectively, wherein the transformed track has the same melody as the original vocal-A track or vocal-B track, but has a timbre according to the selected substitute timbre.

A substitute cross-fader 48A may be provided for controlling a volume level assigned to vocal-A track and a volume level assigned to the first transformed track, in particular a ratio between both volume levels. The substitute cross-fader 48A may be controllable by a user, preferably with only one finger and a single control movement, within a control range between a first end point at which the volume assigned to the first transformed track is maximum and the volume assigned to vocal-A track is minimum, and a second end point at which the volume assigned to vocal-A track is maximum and the volume assigned to the first transformed track is minimum. Alternatively, a simple track selector for selecting either vocal-A track or first transformed track may be used instead of the substitute cross-fader 48A.

In the corresponding way as described for song A, a substitute cross-fader 48B may be provided to control a ratio between a volume level assigned to the second transformed track selected by substitute section 46B and the vocal-B track. Alternatively, a second track selector for selecting either vocal-B track or the second transformed track may be used.

According to the settings of the substitute cross-faders 46A and 46B or, alternatively, the setting of the track selectors, a transformed vocal-A track and a transformed vocal-B track will thus be obtained, which contain either the unchanged decomposed vocal tracks as obtained from the decomposition units 14A, 14B (cross-faders 46A and 46B moved fully towards vocal), or which contain only the transformed tracks (cross-faders 46A and 46B moved fully towards substitute), or which contain a mixture of the decomposed vocal tracks and the transformed tracks (cross-faders 46A and 46B between end points).

A vocal cross-fader 44V may eventually be provided in the decomposition and transformation section 42 as a further control element, which allows cross-fading between playback of transformed vocal-A track and transformed vocal-B track. Thus, by operating the vocal cross-fader 44V, which may be done with only one finger or in a single movement using a single control element, a ratio between a volume level assigned to transformed vocal-A track and a volume level assigned to transformed vocal-B track may be changed. In particular, vocal cross-fader 44V can be controlled by a user within a control range having one end point at which the volume level assigned to transformed vocal-A track is maximum and the volume level assigned transformed vocal-B track is minimum, and a second end point at which the volume level assigned to transformed vocal-B track is maximum and the volume level assigned to transformed vocal-A track is minimum.

In the configuration shown in FIG. 3 , for example, the control elements 44 to 48 are set in such a manner as to play both songs A and B in a mix, wherein the drums of song B are set to a higher volume level than the drums of song A and the harmonic components of songs A and B are set to have equal volume levels. Furthermore for song A also a vocal component is played, while for song B the vocal component is substituted by a piano track having the same melody as the original vocal component of song B. The piano track has the same volume level as the vocal component of song A. In order to improve the mix and maybe perform a transition between the songs, a user could in the next step move the substitute cross-fader 48A towards substitute, i.e. the transformed track, such as to allow substitution of the vocal-A track by a flute track of the same melody. Afterwards, the second substitute cross-fader 48B could be moved towards the vocal-B track, which is then mixed with the flute track of song A. At a later point in time, all cross-faders 44V, 44H and 44D could be moved towards song B such as to complete the transition. 

1. A method for processing audio data, comprising the steps of: providing input audio data containing a mixture of audio data including first audio data of a first musical timbre and second audio data of a second musical timbre different from said first musical timbre; decomposing the input audio data to provide decomposed data representative of the first audio data; and transforming the decomposed data to obtain third audio data, wherein transforming the decomposed data includes at least one of: i. changing musical timbre such that the third audio data are of a third musical timbre different from the first musical timbre, or ii. changing melody, such that the third audio data represents a melody different from that of the decomposed data.
 2. The method of claim 1, wherein transforming the decomposed data includes changing musical timbre and wherein the third audio data and the decomposed data represent the a same melody or represent no melody.
 3. The method of claim 1, wherein the third audio data and the decomposed data have at least one of equal key or equal time-dependent harmony.
 4. The method of claim 1, wherein the third audio data and the decomposed data have the same timbre.
 5. The method of claim 1, wherein changing melody includes: detecting a time dependent musical harmony of the input audio data or decomposed data; and generating, based on the time dependent musical harmony, pitch data for a plurality of individual musical tones of the third audio data, which are to be played sequentially at respective predetermined points in times to generate a melody of the third audio data.
 6. The method of claim 1, further comprising detecting pitch data indicating musical pitches of the decomposed data or the first audio data.
 7. The method of claim 1, further comprising a step of converting the decomposed data into event message data formed by a plurality of event messages of musical tones, wherein each event message at least specifies pitch data and velocity data of a corresponding musical tone.
 8. The method of claim 7, wherein the step of transforming the decomposed data includes synthesizer-based or sample-based generation of audio signals based on the pitch data or the event message data.
 9. The method of claim 1, wherein the step of transforming the decomposed data involves processing of audio data obtained from the decomposed data within an artificial intelligence system.
 10. The method of claim 1, wherein the step of decomposing the input audio data involves processing of audio data obtained from the input audio data within an artificial intelligence system.
 11. The method of claim 1, wherein the step of decomposing the input audio data provides first decomposed data representative of the first audio data, and second decomposed data representative of the second audio data, and wherein the method further comprises a step of recombining audio data obtained from the third audio data with audio data obtained from the second decomposed data to obtain recombined audio data.
 12. The method of claim 1, wherein the step of decomposing the input audio data provides a plurality of sets of decomposed data, wherein each set of decomposed data represents audio data of a predetermined musical timbre, such that a sum of all sets of decomposed data represents audio data substantially equal to the input audio data.
 13. The method of claim 1, wherein the input audio data are provided in the form of at least one input track formed by a plurality of audio frames, the input track having an input track length and each audio frame having an audio frame length, and wherein the step of decomposing the input audio data comprises decomposing a plurality of input track segments each having a length smaller than the input track length and larger than the audio frame length.
 14. The method of claim 13, wherein decomposing the plurality of input track segments obtains a plurality of decomposed track segments; wherein transforming the decomposed data is based on the plurality of decomposed track segments to obtain a plurality of third audio track segments; and wherein a first segment of the third audio track segments is obtained before a second segment of the input track segments is being decomposed.
 15. The method of claim 1, wherein obtaining at least a segment of the third audio data is completed within a processing time of less than about 10 seconds after providing the input audio data or a segment of the input audio data associated with the segment of the third audio data.
 16. The method of claim 1, wherein the step of decomposing the input audio data provides first decomposed data representative of the first audio data, and second decomposed data representative of the second audio data; and wherein the method further comprises a step of generating an output track which includes a first output track portion, which is obtained by recombining audio data obtained from the first decomposed data with audio data obtained from the second decomposed data or which includes the input audio data, and a second output track portion which is obtained by recombining audio data obtained from the third audio data with audio data obtained from the second decomposed data to obtain recombined audio data.
 17. The method of claim 16, wherein the output track includes a transition portion between the first output track portion and the second output track portion, wherein, within the transition portion, in a direction from the first output track portion towards the second output track portion, a first volume level associated to the audio data obtained from the first decomposed data decreases and a second volume level associated to the audio data obtained from the third audio data increases.
 18. The method of claim 1, wherein the input audio data are provided as a first input audio data containing a first piece of music; and wherein the method further comprises the steps of: providing second input audio data containing a second piece of music different from the first piece of music, mixing audio data obtained from the first input audio data with audio data obtained from the second input audio data to obtain mixed audio data, and playback of audio data obtained from the mixed audio data.
 19. The method of claim 18, wherein the second input audio data contains a mixture of audio data including fourth audio data of a fourth musical timbre and fifth audio data of a fifth musical timbre; and wherein the method further comprises the steps of: decomposing the first input audio data to provide first decomposed data representative of the first audio data, and second decomposed data representative of the second audio data, and decomposing the second input audio data to provide fourth decomposed data representative of the fourth audio data, and fifth decomposed data representative of the fifth audio data.
 20. The method of claim 18, wherein the step of transforming the first decomposed data to obtain the third audio data includes changing the musical timbre such that the third musical timbre substantially equals a musical timbre of the second input audio data.
 21. The method of claim 19, further including transforming the fourth decomposed data to obtain sixth audio data.
 22. The method of claim 21, wherein the third audio data and the sixth audio data are of substantially the same timbre.
 23. The method of claim 21, wherein the method further comprises a step of generating an output track which includes a first output track portion obtained by recombining audio data obtained from the third audio data with audio data obtained from the second decomposed data, and a second output track portion obtained by recombining audio data obtained from the sixth audio data with audio data obtained from the fifth decomposed data.
 24. The method of claim 1, wherein the input audio data are obtained from mixing a plurality of sets of source audio data including the first audio data and the second audio data; and wherein the first audio data are generated by or recorded from a first source selected from a first musical instrument, a first software instrument, a first synthesizer and a first vocalist, and the second audio data are generated by or recorded from a source selected from a second musical instrument, a second software instrument, a second synthesizer and a second vocalist.
 25. A device for processing audio data, comprising an input unit configured to receive input audio data containing a mixture of audio data including first audio data of a first musical timbre and second audio data of a second musical timbre different from said first musical timbre; a decomposition unit for decomposing the input audio data to provide decomposed data representative of the first audio data; and a transforming unit for transforming the decomposed data to obtain third audio data, wherein the transforming unit includes at least one of: i. a timbre changing unit configured to change musical timbre such that the third audio data are of a third musical timbre different from the first musical timbre, or ii. a melody changing unit configured to change melody such that the third audio data represent a melody different from that of the decomposed data.
 26. The device of claim 25, wherein the third audio data and the first audio data represent musical tones of the same melody, or wherein the third audio data and the first audio data have equal harmony.
 27. The device of claim 25, wherein the melody changing unit comprises: a harmony detection unit for detecting a time dependent musical harmony of the first audio data; and a pitch data generating unit for generating pitch data for a plurality of individual musical tones of the third audio data, which are to be played sequentially at respective predetermined points in times to generate a melody of the third audio data.
 28. The device of claim 27, further comprising a pitch detection unit for detecting pitch data indicating a musical pitch of the decomposed data or the first audio data.
 29. The device of claim 28, further comprising a data conversion unit for converting the decomposed data into event message data formed by a plurality of event messages of musical tones, wherein each event message at least specifies pitch data and velocity data of a corresponding musical tone.
 30. The device of claim 29, further comprising at least one of a synthesizer unit for synthesizer-based generation of audio signals based on the pitch data or the event message data, or a sample player for sample-based generation of audio signals based on the pitch data or the event message data.
 31. The device of claim 29, wherein at least one of the transforming unit, the timbre changing unit, the melody changing unit, the harmony detection unit, the pitch detection unit, the pitch data generating unit, and the data conversion unit comprises an artificial intelligence system.
 32. The device of claim 25, wherein the decomposition unit comprises an artificial intelligence system.
 33. The device of claim 25, wherein the decomposition unit is configured to decompose the input audio data to provide first decomposed data representative of the first audio data, and second decomposed data representative of the second audio data; and wherein the device further comprises a recombination unit for recombining audio data obtained from the third audio data with audio data obtained from the second decomposed data to obtain recombined audio data.
 34. The device of claim 25, wherein the decomposition unit is configured to decompose the input audio data to provide a plurality of sets of decomposed data, wherein each set of decomposed data represents audio data of a predetermined musical timbre, such that a sum of all sets of decomposed data represents audio data substantially equal to the input audio data.
 35. The device of claim 25, wherein the input audio data contain an input track formed by plurality of audio frames, the input track having an input track length and each audio frame having an audio frame length, and wherein the decomposition unit is adapted to decompose the input audio data by decomposing a plurality of input track segments each having a length smaller than the input track length and larger than the audio frame length.
 36. The device of claim 35, wherein the decomposition unit is configured to decompose the plurality of input track segments to obtain a plurality of decomposed track segments; wherein the transforming unit is configured to generate the third audio data based on the plurality of decomposed track segments to obtain a plurality of third audio track segments; and wherein the device is configured to obtain a first segment of the third audio track segments before a second segment of the input track segments is being decomposed.
 37. The device of claim 36, wherein the device is configured such that obtaining at least a segment of the third audio data is completed within a processing time of less than about 10 seconds after providing the input audio data or a segment of the input audio data associated to the segment of the third audio data.
 38. The device of claim 25, wherein the decomposition unit is configured to decompose the input audio data to provide first decomposed data representative of the first audio data, and second decomposed data representative of the second audio data; wherein the device further comprises an output unit generating an output track which includes a first output track portion, which is obtained by recombining audio data obtained from the first decomposed data with audio data obtained from the second decomposed data or which substantially includes the input audio data, and a second output track portion obtained by recombining audio data obtained from the third audio data with audio data obtained from the second decomposed data; and wherein the device further comprises a user control unit for receiving a user control input determining a starting point or an end point of the first output track portion or the second output track portion.
 39. The device of claim 38, wherein the user control unit includes a crossfader for setting a ratio between a first volume level associated to the audio data obtained from the first decomposed data and a second volume level associated to the audio data obtained from the third audio data.
 40. The device of claim 25, wherein the input unit has a first input section for receiving first input audio data containing a first piece of music, and a second input section for receiving second input audio data containing a second piece of music different from the first piece of music; and wherein the device further comprises: a mixing unit configured for mixing audio data obtained from the first input audio data with audio data obtained from the second input audio data to obtain mixed audio data, and a playback unit for playing audio data obtained from the mixed audio data. 